-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MPI to driver code #34
Conversation
All changes were made in the driver code, and no changes should be required in the different implementations. Changes involved: 1. Initilising MPI and getting rank and size 2. Guarding std::cout so only rank 0 prints 3. Adding Barriers between kernels 4. Dot produce performs Reduction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine overall. Successfully ran across 256 nodes with decent efficiency.
I've made some minor suggestions w.r.t. output, but nothing particularly important.
What are we going to do about the Makefiles? Do we want some standard way of enabling MPI? For OpenMP you can do it without touching the Makefile, but other models like CUDA need some work.
I can imagine a couple of possible ways of doing this. There could be some variable the user can set:
make -f OpenMP.make MPI=1
Or we could have a special target in each Makefile:
make -f OpenMP.make mpi
Or something else. Not really sure what's best, just brainstorming here.
main.cpp
Outdated
std::streamsize ss = std::cout.precision(); | ||
std::cout << std::setprecision(1) << std::fixed | ||
<< "Array size: " << ARRAY_SIZE*sizeof(T)*1.0E-6 << " MB" | ||
<< " (=" << ARRAY_SIZE*sizeof(T)*1.0E-9 << " GB)" << std::endl; | ||
std::cout << "Total size: " << 3.0*ARRAY_SIZE*sizeof(T)*1.0E-6 << " MB" | ||
<< " (=" << 3.0*ARRAY_SIZE*sizeof(T)*1.0E-9 << " GB)" << std::endl; | ||
std::cout.precision(ss); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth making these include the phrase "per rank" for these printouts when in MPI mode to make things crystal clear?
Could also show the actual total size, but not sure how interesting that is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a nice way to alter the string depending on the USE_MPI
preprocessor define?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, I think this is about as good at can get:
std::cout << std::setprecision(1) << std::fixed
#ifdef USE_MPI
<< "Array size (per MPI rank): "
#else
<< "Array size: "
#endif
<< "...";
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in cb92f94
@@ -179,6 +237,10 @@ void run() | |||
check_solution<T>(num_times, a, b, c, sum); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be in an #ifdef USE_MPI \n if(rank == 0)
section as well - otherwise you get massively spammed if verification fails.
On a related note, we really need to sort out verification for the reduction as per #20, since this almost always fails now when using a large number of MPI ranks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -187,6 +249,7 @@ void run() | |||
<< std::left << std::setw(12) << "Average" << std::endl; | |||
|
|||
std::cout << std::fixed; | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine, but might be interesting to see these results per rank as well as the aggregate bandwidth. This would make it easier to see how well things scale when running on lots of nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One could always run the benchmark with -np 1
(or run the non-MPI benchmark). We could also run the benchmark on a singe node first, and then print out the efficiency. This would probably need a bit of an overhaul in the way the driver works though. Not sure if it's worth the pain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just meant that it might be nice to print out the average per-rank bandwidths, I'm not talking about actually re-running on a single node.
e.g. If I run on a single P100 I get 550 GB/s. When I run on a cluster of 2000 P100s, I get some very large number, but do I still get 550 GB/s per rank?
To be honest I'm just being lazy - it's trivial for the user to open a calculator and do that division themselves. Feel free to ignore this comment :-)
double average = std::accumulate(timings[i].begin()+1, timings[i].end(), 0.0) / (double)(num_times - 1); | ||
double average = std::accumulate(timings[i].begin()+1, timings[i].end(), 0.0); | ||
|
||
#ifdef USE_MPI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation inside this block looks a little off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in a57fdec
Re the Makefile changes. I guess the ideal solution will depend on how the changes are required. The |
Prevents lots of output in case of failure
ef22cbc implements Does this approach look OK? If so, I'll start to roll out similar changes to the other models. |
Actually, piggybacking on |
Yeah looks fine to me, bar the |
Closing as we'd prefer standalone MPI implementation. Opening new issue to this effect. |
Add the ability to run BabelStream across multiple nodes with MPI. The input size is assumed to be per MPI rank, and therefore is similar to weak scaling. The reported bandwidth is of the entire parallel run.